LightScan: Faster Scan Primitive on CUDA Compatible Manycore Processors

نویسندگان

  • Yongchao Liu
  • Srinivas Aluru
چکیده

Scan (or prefix sum) is a fundamental and widely used primitive in parallel computing. In this paper, we present LightScan, a faster parallel scan primitive for CUDA-enabled GPUs, which investigates a hybrid model combining intrablock computation and inter-block communication to perform a scan. Our algorithm employs warp shuffle functions to implement fast intra-block computation and takes advantage of globally coherent L2 cache and the associated parallel thread execution (PTX) assembly instructions to realize lightweight inter-block communication. Performance evaluation using a single Tesla K40c GPU shows that LightScan outperforms existing GPU algorithms and implementations, and yields a speedup of up to 2.1, 2.4, 1.5 and 1.2 over the leading CUDPP, Thrust, ModernGPU and CUB implementations running on the same GPU, respectively. Furthermore, LightScan runs up to 8.9 and 257.3 times faster than Intel TBB running on 16 CPU cores and an Intel Xeon Phi 5110P coprocessor, respectively. Source code of LightScan is available at http://cupbb.sourceforge.net.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SPAP: A Programming Language for Heterogeneous Many-Core Systems

We present SPAP (Same Program for All Processors), a containerbased programming language for heterogeneous many-core systems. SPAP abstracts away processor-specific concurrency and performance concerns using containers. Each SPAP container is a high level primitive with an STL-like interface. The programmervisible behavior of the container is consistent with its sequential counterpart, which en...

متن کامل

The nonequispaced FFT on graphics processing units

Without doubt, the fast Fourier transform (FFT) belongs to the algorithms with large impact on science and engineering. By appropriate approximations, this scheme has been generalized for arbitrary spatial sampling points. This so called nonequispaced FFT is the core of the sequential NFFT3 library and we discuss its computational costs in detail. On the other hand, programmable graphics proces...

متن کامل

Sorting using BItonic netwoRk wIth CUDA

Novel “manycore” architectures, such as graphics processors, are high-parallel and high-performance shared-memory architectures [7] born to solve specific problems such as the graphical ones. Those architectures can be exploited to solve a wider range of problems by designing the related algorithm for such architectures. We present a fast sorting algorithm implementing an efficient bitonic sort...

متن کامل

Spatial Scan Statistics on the GPGPU

Kulldorff’s spatial scan statistic and the software implementation (SaTScan) are widely used for the detection and evaluation of geographic clusters, particularly within the health care community. Unfortunately, the computational time of the scan statistic depends on a wide variety of variables, and, depending on the chosen parameter settings and operations, the computational time can be on the...

متن کامل

Energy Introspector: Simulation Infrastructure for Power, Temperature, and Reliability Modeling in Manycore Processors

This paper presents an architectureindependent modeling infrastructure called the Energy Introspector for estimating non-functional aspects of processors such as energy, power, temperature, area, delay, sensor, and reliability. The Energy Introspector supports processor modeling through the integration of various modeling tools. It features structural abstraction of physical and microarchitectu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1604.04815  شماره 

صفحات  -

تاریخ انتشار 2016